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Abstract 

We address the problem of synthetic gene design using Bayesian optimization. 
The main issue when designing a gene is that the design space is defined in terms 
of long strings of characters of different lengths, which renders the optimization 
intractable. We propose a three-step approach to deal with this issue. First, we 
use a Gaussian process model to emulate the behavior of the cell. As inputs of 
the model, we use a set of biologically meaningful gene features, which allows 
us to define optimal gene designs rules. Based on the model outputs we define 
a multi-task acquisition function to optimize simultaneously severals aspects of 
interest. Finally, we define an evaluation function, which allow us to rank sets of 
candidate gene sequences that are coherent with the optimal design strategy. We 
illustrate the performance of this approach in a real gene design experiment with 
mammalian cells. 


1 Introduction 

Synthetic biology concerns with the design and construction of new biological elements of living 
systems and the re-design of existing ones for useful purposes 0. In this context, there is a current 
interest in the development of new methods to engineer living cells in order to produce compounds 
of interest DU- A modern approach to this problem is the use of synthetic genes, which once 
‘inserted’ in the cells can modify their natural behavior activating the production of proteins useful 
for further pharmaceutical purposes. 

We present the first approach for gene design based on Bayesian optimization (BO) principles. The 
BO framework lfl4l [3;, )T0ll6l allows us to explore the gene design space in order to provide rules to 
build genes with interesting properties, such as genes that are able to produce proteins of interest, 
or genes able to act on the cell lifespan. We use a Gaussian process B2 to emulate the complex 
behavior of the cell across the different gene designs. An acquisition function is optimized to deal 
with exploration-exploitation trade-off. To provide, not only rules for gene design, but current gene 
sequences candidates, we introduce the concept of evaluation function. The goal of this functions 
is to avoid the bottleneck of optimizing over the sequences by providing rules to rank biologically 
feasible genes coherent with the obtained gene design rules. 

Although in this work we focus on the optimization of the translational efficiency of the cells, this 
framework can be generalized to multiple synthetic biology design problems, such as the optimiza¬ 
tion of media design, or the optimization of multiple gene knock-out strategies. 

2 Rewriting the Genetic Code 

Broadly speaking, in molecular biology it is assumed that a gene contains the information to encode 
a mRNA molecule. The production of such molecules is called transcription and takes place in 
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Figure 1: Top: Central dogma of molecular biology. Each gene contains the information to encode 
an mRNA molecule which is used by the cell to produce proteins. Both, mRNA molecules and 
proteins are produced at certain rate. The goal is to design genes sequences able to increase the rates 
while encoding the same protein. Bottom left: Graph of codons redundancy. Letters inside the circle 
represent DNA basis and letters outside represent amino acids. The arcs outside the circle cover the 
paths of redundant codons, f and * represent special codons. Bottom right: Graphical model of the 
multi-output Gaussian process used to emulate the cell behavior in this work. 

the cell nucleus at certain rate y a . Later on, the mRNA molecules are used to produce proteins at 
a different rate yp, in what is called the translation phase. A one-to-one correspondence between 
genes and proteins is assumed. Each gene, itself, is also made up of a sequence s of several hundreds 
of bases (A, T, G, G), triplets of which, form the so-called codons. We can interpret the codons like 
the ‘words’ in which the genetic code is written. The 64 possible codons encode 20 amino acids, 
which are the fundamental elements that the cell uses to produce proteins. This means that the 
genetic code is redundant: the same aminoacid can be encoded by different codons and therefore 
there exist multiple ways of encoding the same protein. See Figure |T] (top and bottom left) for an 
illustration of this process. A fundamental of gene design is that redundant codons choices do not 
affect the type of protein that is being encoded but they may affect the rates y a , yp , and therefore 
the efficiency at which it is produced. 

Consider a p-dimensional representation x £ 1R P of a gene sequence s. Such a representation 
will typically be the frequency of the different codons but it may also include other variables like 
the length of the sequence, or the times a certain pattern is repeated across the gene. Denote by 
fa, fp '■ X —> IR x IR the functions representing the expected transcription and translation rates 
given a sequence with features x £ X. We want to solve the global optimization problem of finding 
the sequence that maximizes both rates. However, this requires optimization across the all possible 
sequences. This is infeasible due to the high dimensionality. Instead, we aim to solve the surrogate 
muti-objective problem of finding x* = argma x xe x(f a (x), fp(x)), to later connect x* with a 
particular gene design. 

3 Method used 

3.1 Multi-output Gaussian Processes as Cell-behavior Surrogate Model 

Let {si,..., s^} be a set of gene sequences. Consider a p-dimensional feature representation of the 
sequences given by {xi,.... x.v} where x, £ IIL'. Let y (1: , yp £ IR N be the observed transcription 
and translation rates. Our first goal is to learn from a model of the combined data T> = {V a ,Vp}, 
where T> a = {*i,y a ,i}ifi and T>p = {xi,ypy}fL i> how to predict the value of the output functions 
fcJx) and fp (x) at any x £ 1R P . For simplicity we assume here that both rates are available for all 
the sequences, but this assumption can be easily relaxed. 
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Algorithm 1 Bayesian optimization for gene design 

Extract features {xi,..., Xjv } from the available gene sequences {si,..., Sjv } ■ 
Take "Di = {V a , Vp}, where T> a = {x i; and Vp = {x i5 ypp}^ =1 . 

for t = 1 , 2, . .. do 

Fit a multi-output GP model using V t . 

Obtain design rules by taking xt+i = arg maxxgA- acqu(x\T>t) ■ 

Generate a set of candidate gene sequences S. 

Rank the sequences in S and select St+i = argminsgs eval(s\xt+i). 

Run experiment using St+i and extract features x t _|_i from the sequence St+i. 
Augment the data T> t +i = {D t , (x t+ i, {y a ,t+i, yp,t+i))}. 

end for 

Returns: Optimal gene design s*. 


A Gaussian process (GP) is a stochastic process with the property that each linear finite-dimensional 
restriction is multivariate Gaussian fl2l . GPs are typically used as prior distribution over functions. 
In the simple output case, the random variables are associated to a single process / evaluated at 
different x but, this can be easily generalized to multiple outputs, where the random variables are 
associated to different processes {fi}? =1 - In our case we have d = 2, which correspond to the two 
rates of interest. We work therefore with the vector-value function f := (f a , fp), which is assumed 
to follow a GP f ~ Q'P (in. K) where m is a 2-dimensional vector whose components are the mean 
functions m a , mp of each output and K is a positive matrix valued function that acts directly on input 
example and tasks indices. The entries (K(x, x'));^ in K(x, x') represent the covariance between 
f a (x) and fp(x'). Under a Gaussian likelihood assumption, the predictive distribution for a new 
vector x* is taken to be Gaussian such that p(f(x*)|2?,f,x*,</>) = A/”(f*(x*),K*(x*,x*)) where 
f* (x*) and K (x*, x,) are close expressions that depend on the set of input X and the kernel K. See 
EMI for details. 0 represents all the parameters of the kernel, which can be built following various 
strategies. In this work we use a combination of the linear and the intrinsic corregionalization models 
E6i m. We take K(X, X) = Bi ® Ku n (X , X) + B 2 0) K se (X, X), where Ku n is a linear kernel used 
to account for the different levels of the rates and K se a square exponential kernel with a different 
lengthscale per dimension. B i, n , B s( are the corregionalization matrices, which are parametrized as 
B Un = Wiin'Wun + Kiinh and B se = xv se xvj e + n se I 2 for w iin , w se , Ku n , n se £ 1R 2 and I 2 is the 
identity matrix of dimension 2. 0 represents the Hadamard product. See Figure [T] (bottom right) for 
a graphical description of the model. 

3.2 Acquisition and Evaluation Functions 

In multi-task optimization problems a typical issue is to deal with potential conflicting objectives, 
or tasks that cannot be optimized simultaneously; in our case this means that both rates cannot be 
optimized simultaneously using the same sequence. Following previous work in multi-task Bayesian 
optimization E3, here we focus on an acquisition function that maximizes the average of the tasks. 
The predictive mean and variance of the average objective are m(x) = | 0 f* ( x )> anc * & 2 ( x ) = 

T? 0 S;'=a / 3 (K*( x , x ))m'- Both m(x) and <r 2 (x) can be used in a standard way using any 

acquisition function acq(x), such as the expected improvement (El) lfl4l . 

Consider the optimal gene design given by x* = arg maxxg a' acq(x) . Assume that we are interested 
in the production of a certain protein whose sequence is s*. To improve the sequence Sk according to 
the optimal design rules x* without changing the nature of the protein, we can interchange redundant 
codons, that is, codons encoding the same aminoacid. See Figure 1 (bottom right). Given a set of 
sequences satisfying this criteria, we introduce an evaluation function to rank them in terms of their 
coherence with the optimal design. In particular we choose eval{ s|x*) = w j x j x y I where 

s is a ‘coherent’ sequence, the x ? are the features of s, x* are the features of the optimal design and 
Wj are weights that we choose to be the inverse lengthscales of the I\ se . See Algorithm 1. 


4 Gene Optimization in Mammalian Cells 

In this experiment we use Bayesian Optimization to design genes able to optimize the transcription 
and translation rates of mammalian cells. We use the dataset in E3 in which the rates of 3810 
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Biological features used in the model 



Figure 2: Top : Inverse lengthscales of the ARD component of the model. Bottom left : Optimal 
design rules selected using El in the context of the true performance of 3810 genes in mammalian 
cells. Bottom right: comparison of the true performance of 10 genes with the predicted performance 
of recombinant genes selected among 1,000 random generated sequences. 95% confident intervals 
for the predictions are shown in blue. 


genes cells are available. The associated sequences were extracted from http://www.ensembl.org 
using the first ENSEBL identifier of the database. As features of the model, we used the frequency 
of appearance the 64 codons, together with the length of the gene, the GC-content, the AT-content, 
the GC-ratio and the AT-ratio. We randomly sampled 1500 genes that we used to train the model 
described in Section [3] We fit the hyper-parameters of the model by the standard method of maxi¬ 
mizing training-set marginal likelihood, using L-BFGS J9) for 1,000 iterations and selecting the best 
of ten random restarts. We select the optimal gene design by means of the expected improvement, 
which we optimize across all the available genes (both train and test sets) in order to have a way 
of evaluating the coherence of the result with real experimental data. In Figure [2] (bottom left) we 
show the scatter plot of the El evaluated in all gene features vs. the true average of the ratios. The 
best possible design is selected by the El criteria. Next, we select 10 difficult-to-express genes by 
selecting ten random genes among those whose average log ratio is smaller than 1.5. By taking their 
sequence as a reference we generated 1,000 random sequences (for each gene) able to encode the 
same protein. All across the sequences, we replace each codon with a redundant one, which is sam¬ 
pled uniformly from the set of codons encoding the same aminoacid. Using the evaluation function 
in Section ( |3.2[ > we ranked the sequences and selected the top rated, in Figure 2 (bottom right) we 
show the true performance of the sequence (experimental value) versus the predicted value of best 
recombinant sequences. In the ten cases the recombinant sequence outperforms the original one. 


5 Conclusions and Challenges 

We have shown that Bayesian optimization principles can be successfully used in the design of 
synthetic genes. One of the most important aspects in this process is to have a good surrogate model 
for the cell behavior able to lead to appropriate acquisition functions. Considering future models, 
the fact that the cell is a extremely complex system will be a key aspect to take into account. To 
optimize certain features of the cell, massive amounts of data will be required, which will require 
the used of sparse Gaussian processes. Regarding the optimization aspects of the problem, in this 
work we have worked with a set of features extracted from the gene sequences, which we have 
used to obtain gene design rules rather than optimal sequences. The use of more features will 
potentially lead to better and more specific gene design. This will require, however, the development 
of scalable Bayesian optimization methods able to work well in high dimensions in the line of some 
recent works 03012121 . An alternative approach is to focus directly on the optimization on the 
sequences rather than on extracted features by omitting any previous biological knowledge. This 
seems feasible from the modeling point of view by means of the use of string or related kernels |8) 
but the optimization of the acquisition functions derived from this models remains challenging. 
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